history: per-message cost, tokens, and latency tracking by akrentsel · Pull Request #16 · ankrgyl/exo

akrentsel · 2026-05-24T22:02:52Z

Summary

Adds an optional UsageRecord to every EventData::Messages event so we have a durable, per-message record of:

Model id (echoed by the provider, or the requested binding as fallback)
Raw token counts — prompt, completion, cached, cache-creation, reasoning
USD cost — computed at call time from the LiteLLM pricing database (loaded at runtime, see below)
TTFT + wall-clock duration — measured in the executor on both streaming and non-streaming paths
server_duration_ms — reserved (lingua does not yet surface a provider-reported processing time)

Pricing: why runtime LiteLLM JSON, not a hardcoded table

OpenAI and Anthropic standard APIs return tokens but not USD in their per-call responses. (OpenAI never; Anthropic only in a separate aggregate Admin API.) So cost must always be computed downstream.

Initial version hardcoded a small price table in Rust. That was a mistake — by the time I wrote the first commit, my table already had Claude Opus 4.7 at $15/$75 per MTok when the real price had dropped to $5/$25. Hand-maintained tables drift.

This version loads LiteLLM's pricing database (2,739 model entries, community-maintained, covers all major providers + Bedrock/Azure/Vertex regional variants):

First use: HTTP fetch into `$XDG_CACHE_HOME/exo/litellm_prices.json`
Subsequent uses (within 24h): cached
After 24h: re-fetch, fall back to stale cache if network fails
All fails: empty table → tokens persist, `cost_usd: None` (no crash)
Env var overrides: `EXO_LITELLM_PRICES_PATH` (local file) and `EXO_LITELLM_PRICES_URL` (alternate source)

Cached-vs-fresh: per-provider accounting matters

Different providers report cached tokens with different conventions, and getting this wrong distorts cost by up to ~10× on cache-heavy requests:

Anthropic-family (anthropic, bedrock_converse, vertex_ai-anthropic_models, azure_ai): prompt_tokens is fresh input only. cache_read and cache_creation are separate. Bill all three additively.
OpenAI-family (openai, mistral, etc.): prompt_tokens is total (including cached). Cached is a subset. Must subtract cached from prompt before billing fresh-input rate.

The first commit got OpenAI wrong (used the additive formula universally). This version branches on LiteLLM's `litellm_provider` field and applies the correct formula. Both formulas have dedicated unit tests with realistic token mixes.

Architecture

`exoharness::pricing` (pure data + math, no network, stays wasm-compatible):
- `PricingTable::from_json_str` parses LiteLLM's schema (silently ignores entries that don't have per-token rates, like image/embedding entries).
- `PricingTable::lookup` does exact + longest-prefix match (dated revisions like `claude-sonnet-4-6-20251022` resolve to `claude-sonnet-4-6`).
- `PricingTable::compute_cost_usd` does per-provider math.
`executor::pricing_loader` (network layer, gated by `tokio::sync::OnceCell`):
- `get_pricing_table()` returns `Arc`; loads once per process, then zero-cost.
- Resolution: `EXO_LITELLM_PRICES_PATH` → cache (fresh) → fetch → stale cache → empty.
`BasicExecutor::with_pricing` + `BasicHarness::with_pricing_table`: explicit-table constructors. Bypass the loader, useful for tests/embedders/air-gapped deployments.

What's in the event JSON now

```json
{
"type": "messages",
"messages": [...],
"response_id": "01J...",
"usage": {
"model": "claude-sonnet-4-6",
"prompt_tokens": 2847,
"completion_tokens": 412,
"prompt_cached_tokens": 12500,
"cost_usd": 0.0146985,
"ttft_ms": 842,
"duration_ms": 3210
}
}
```

All `usage` sub-fields are `Option` + `skip_serializing_if`. Legacy events with no `usage` key continue to deserialize.

Test plan

`cargo test --workspace` — 66 tests pass (was 60 on main; +11 pricing units, +1 end-to-end cost assertion, +2 backward-compat for UsageRecord JSON, -7 from refactoring the previous hardcoded-table tests away).
Pricing unit tests cover: Anthropic additive (no cache, cache hits, cache creation), OpenAI inclusive (with and without cache, subtraction semantics asserted), Bedrock regional surcharge, longest-prefix lookup, sample_spec doc entry skipped, unknown models, provider-style classification.
`pricing_loader::tests::local_path_override_is_honored` exercises the env-var override path.
End-to-end test `usage_record_is_persisted_with_computed_cost` uses `with_pricing_table` to inject an inline fixture — assertion is hermetic, no network dependency in CI.
`cargo test --package exo --test integration_chat -- --ignored` (mocked OpenAI + real local-process sandbox): unchanged, passes.
Loader exercised live: fetch succeeded, cache populated, subsequent calls cache-hit.

Not in scope (intentional)

`/cost` REPL command — persisting silently for now. UI surface is an easy follow-up.
Server-reported duration — lingua doesn't expose this today; field reserved for when it does.
Cache tier resolution — `UniversalUsage` collapses Anthropic's 5-min and 1-hour cache writes into one count. Cost defaults to the 5-minute rate. `cache_creation_input_token_cost_above_1hr` is parsed but unused.
Anthropic's aggregate Usage/Cost Admin API — separate org-level endpoint, aggregate-only, can't be tied to a specific message. Not useful for per-message tracking.

🤖 Generated with Claude Code

akrentsel · 2026-05-24T22:09:34Z

please don't review the code yet – I'm looking into a better way to get pricing details...

akrentsel · 2026-05-25T17:01:55Z

This is addressing #15

akrentsel · 2026-05-27T16:50:15Z

@ankrgyl this is ready to review, added tests

akrentsel · 2026-05-27T17:00:29Z

@ankrgyl this is ready to review, added tests

Alexsun1one

two small things on the price lookup side, otherwise this looks great. the LiteLLM-as-source-of-truth + per-provider accounting story is the right call, and i like that with_pricing keeps embedders/air-gapped runs honest. inline notes below.

Alexsun1one · 2026-05-27T18:12:12Z

+        }
+        self.entries
+            .iter()
+            .filter(|(key, _)| model.starts_with(key.as_str()))


prefix-match without a separator boundary can fall through in surprising ways. e.g. gpt-4o-mini matches gpt-4o (intended) but also matches gpt-4 (not intended), since "gpt-4o".starts_with("gpt-4") is true; if the more specific entry is missing the longest-prefix winner is still wrong.

if the goal is the claude-sonnet-4-6-20251022 -> claude-sonnet-4-6 case, tightening the filter to require model[key.len()..] be empty or start with - / : keeps that behavior and stops gpt-4o from sliding into gpt-4 when an entry is absent.

Alexsun1one · 2026-05-27T18:12:13Z

+        // two when cached==0 (the typical case).
+        match provider {
+            Some(p) if p.starts_with("anthropic") => Self::Additive,
+            Some(p) if p.starts_with("bedrock") => Self::Additive,


this catches bedrock_converse correctly but also catches plain bedrock provider entries, which on Bedrock covers Mistral, Cohere, Meta Llama, AI21, and friends. those follow OpenAI-style inclusive prompt_tokens, not Anthropic-style additive.

the real-world impact is small today because cache_read on non-Anthropic Bedrock models is uncommon. but if/when those providers add caching, the formula will over-count fresh-input tokens.

probably starts_with("bedrock_converse") only, or an explicit allow-list of bedrock-anthropic provider strings.

ankrgyl · 2026-05-30T18:06:49Z

+
+async fn try_load() -> anyhow::Result<PricingTable> {
+    // 1. Local path override — used by tests and air-gapped setups.
+    if let Ok(path) = std::env::var("EXO_LITELLM_PRICES_PATH") {


i prefer all env vars to be parsed through clap, so that someone could propagate the pricing table as a CLI arg too. it also forces all functions (like this one) to be relatively pure

maybe we add this to a skill in the repo somewhere?

Add an optional UsageRecord to every EventData::Messages event: model id, raw token counts (prompt / completion / cached / cache-creation / reasoning), USD cost, and TTFT + wall-clock duration. Fields are Option + skip_serializing_if and the record is boxed; legacy events still parse. Cost is policy, computed in userspace, never by the trusted substrate: - crates/cost: a standalone library with the price-table data model, a self-contained LiteLLM loader (explicit path/url, on-disk cache, degrade-to-empty), and per-provider math. Lookup is boundary-aware so dated revisions resolve without sliding a model onto a shorter neighbor's rate. Anthropic-family bills additively; everything else (including Bedrock, a TODO) is inclusive. - exoharness stays minimal: it holds the UsageRecord schema and persists it verbatim, with no pricing code or dependency. - Basic executor fills cost from a table loaded once at startup and injected via the CLI (--pricing-path / --pricing-url, env as fallback). - The TypeScript harness (exoclaw) has its own self-contained cost port (@exo/model-runtime/cost) that owns its data loading (env override, own cache, own fetch) through the harness's normal config flow, so per-message cost works there with no dependency on the Rust loader or the trusted layer. RLM is left unwired for now: its multi-call turn has different per-message accounting and is a separate follow-up. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Write up the cost-tracking design: cost as a userspace policy library (self-contained per language), the minimal substrate, per-provider math, boundary-aware lookup, the loader, and the trust framing (usage is agent-reported telemetry). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

akrentsel · 2026-06-12T17:42:23Z

rewriting in #54

akrentsel mentioned this pull request May 25, 2026

OpenAI Responses API: cache misses on every request (no prompt_cache_key sent) #24

Open

akrentsel force-pushed the feature/message-cost-tracking branch from 0de06df to 9933121 Compare May 26, 2026 22:56

akrentsel marked this pull request as ready for review May 27, 2026 16:31

akrentsel requested a review from ankrgyl May 27, 2026 16:49

akrentsel mentioned this pull request May 27, 2026

cli: /usage (alias /cost) REPL command #29

Open

3 tasks

Alexsun1one reviewed May 27, 2026

View reviewed changes

akrentsel mentioned this pull request May 27, 2026

fix: honor repl --model (#30) + send prompt_cache_key on Responses API (#24) #31

Draft

4 tasks

ankrgyl approved these changes May 30, 2026

View reviewed changes

akrentsel force-pushed the feature/message-cost-tracking branch 3 times, most recently from a0092fc to 74c28a7 Compare June 12, 2026 05:14

akrentsel and others added 2 commits June 12, 2026 05:28

akrentsel force-pushed the feature/message-cost-tracking branch from 74c28a7 to dd64f38 Compare June 12, 2026 05:29

akrentsel mentioned this pull request Jun 12, 2026

history: per-message cost, tokens, and latency tracking (on self-control) #54

Closed

akrentsel closed this Jun 12, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

history: per-message cost, tokens, and latency tracking#16

history: per-message cost, tokens, and latency tracking#16
akrentsel wants to merge 2 commits into
mainfrom
feature/message-cost-tracking

akrentsel commented May 24, 2026 •

edited

Loading

Uh oh!

akrentsel commented May 24, 2026

Uh oh!

akrentsel commented May 25, 2026

Uh oh!

akrentsel commented May 27, 2026

Uh oh!

akrentsel commented May 27, 2026

Uh oh!

Alexsun1one left a comment

Uh oh!

Alexsun1one May 27, 2026

Uh oh!

Alexsun1one May 27, 2026

Uh oh!

ankrgyl May 30, 2026

Uh oh!

akrentsel commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

akrentsel commented May 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Pricing: why runtime LiteLLM JSON, not a hardcoded table

Cached-vs-fresh: per-provider accounting matters

Architecture

What's in the event JSON now

Test plan

Not in scope (intentional)

Uh oh!

akrentsel commented May 24, 2026

Uh oh!

akrentsel commented May 25, 2026

Uh oh!

akrentsel commented May 27, 2026

Uh oh!

akrentsel commented May 27, 2026

Uh oh!

Alexsun1one left a comment

Choose a reason for hiding this comment

Uh oh!

Alexsun1one May 27, 2026

Choose a reason for hiding this comment

Uh oh!

Alexsun1one May 27, 2026

Choose a reason for hiding this comment

Uh oh!

ankrgyl May 30, 2026

Choose a reason for hiding this comment

Uh oh!

akrentsel commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

akrentsel commented May 24, 2026 •

edited

Loading